A Formalism for Universal Segmentation of Text

نویسنده

  • Julien Quint
چکیده

Sumo is a formalism for universal segmentation of text Its purpose is to provide a framework for the creation of segmentation applications It is called universal as the formalism itself is independent of the language of the documents to process and independent of the levels of seg mentation e g words sentences paragraphs morphemes considered by the target applica tion This framework relies on a layered struc ture representing the possible segmentations of the document This structure and the tools to manipulate it are described followed by detailed examples highlighting some features of Sumo

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

Spécification et réalisation d'un formalisme générique pour la segmentation multiple de documents textuels multilingues

The issue of word segmentation, or tokenization, is often treated as a trivial matter becauseof the use of separators in writing. The rise of the Internet and the Web led to the availability of millionsof documents in countless languages, which in turn led to a renewed interest for mutlingual applications.These applications rapidly showed the limitations of the simplistic approaches...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000